Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce the buffer when using high dimensional data in distributed mode. #2485

Merged
merged 17 commits into from
Oct 15, 2019

Conversation

guolinke
Copy link
Collaborator

@guolinke guolinke commented Oct 1, 2019

to fix #2484

@guolinke guolinke requested a review from chivee as a code owner October 1, 2019 15:14
bool force_findbin_in_single_machine = false;
if (Network::num_machines() > 1) {
int total_num_feature = Network::GlobalSyncUpByMin(num_col);
size_t esimate_sync_size = BinMapper::SizeForSpecificBin(config_.max_bin) * total_num_feature;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still lead to overflow, need to cast the operands before assigning to a wider type.

Suggested change
size_t esimate_sync_size = BinMapper::SizeForSpecificBin(config_.max_bin) * total_num_feature;
size_t esimate_sync_size = (size_t)BinMapper::SizeForSpecificBin(config_.max_bin) * (size_t)total_num_feature;

bool force_findbin_in_single_machine = false;
if (Network::num_machines() > 1) {
int total_num_feature = Network::GlobalSyncUpByMin(dataset->num_total_features_);
size_t esimate_sync_size = BinMapper::SizeForSpecificBin(config_.max_bin) * total_num_feature;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, avoids overflow.

Suggested change
size_t esimate_sync_size = BinMapper::SizeForSpecificBin(config_.max_bin) * total_num_feature;
size_t esimate_sync_size = (size_t)BinMapper::SizeForSpecificBin(config_.max_bin) * (size_t)total_num_feature;

@thvasilo
Copy link

thvasilo commented Oct 2, 2019

These changes broke something in distributed training, now datasets that were fine before also throw an MPI error on an MPI_Recv call somewhere:

salloc -N 3 mpiexec --machinefile /shared/hostnames.txt ../../lightgbm config=train.conf data=/shared/data/avazu-app.val num_trees=3 tree_learner=data
salloc: Granted job allocation 21
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=/shared/kdda.t will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=data will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=/shared/kdda.t will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=data will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=/shared/kdda.t will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=data will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Local rank: 2, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 1, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 0, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Finished loading data in 2.873416 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Finished loading data in 2.883152 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Finished loading data in 2.892466 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Total Bins 9926
[LightGBM] [Info] Total Bins 9935
[LightGBM] [Info] Total Bins 9891
[LightGBM] [Info] Number of data: 650820, number of used features: 9901
[LightGBM] [Info] Number of data: 651651, number of used features: 9912
[LightGBM] [Info] Number of data: 651480, number of used features: 9869
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129703 -> initscore=-1.903591
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129677 -> initscore=-1.903820
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129048 -> initscore=-1.909405
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605
[ip-172-31-59-36:05691] *** An error occurred in MPI_Recv
[ip-172-31-59-36:05691] *** reported by process [478871553,1]
[ip-172-31-59-36:05691] *** on communicator MPI_COMM_WORLD
[ip-172-31-59-36:05691] *** MPI_ERR_TRUNCATE: message truncated
[ip-172-31-59-36:05691] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-59-36:05691] ***    and potentially your MPI job)
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: ip-172-31-56-120
  Local PID:  5770
  Peer host:  ip-172-31-59-36
--------------------------------------------------------------------------
salloc: Relinquishing job allocation 21
salloc: Job allocation 21 has been revoked.

@guolinke
Copy link
Collaborator Author

guolinke commented Oct 3, 2019

[LightGBM] [Info] Total Bins 9926
[LightGBM] [Info] Total Bins 9935
[LightGBM] [Info] Total Bins 9891

I find the total bin of different machines are different.
Did you use the same config for different nodes, and without pre-partition dataset?

@guolinke
Copy link
Collaborator Author

guolinke commented Oct 3, 2019

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129703 -> initscore=-1.903591
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129677 -> initscore=-1.903820
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129048 -> initscore=-1.909405

and different init_score, it seems the dataset used in these machine are different

@thvasilo
Copy link

thvasilo commented Oct 3, 2019

That is indeed weird. My data and LightGBM distribution lie on shared NFS, so all nodes have access to the same data.

As I said, the exact same dataset (avazu-app.val) trains to completion on the current master, so I think this PR has some kind of side-effect on the data loading.

@guolinke
Copy link
Collaborator Author

guolinke commented Oct 3, 2019

@thvasilo sorry, I realize there is one more place i need to fix too, I will update this soon.

@thvasilo
Copy link

thvasilo commented Oct 3, 2019

Tried the current PR, avazu-app.val with 1M features will now error out with [LightGBM] [Fatal] Too many features for distributed model, buffer is not enough. It is better to pass categorical feature directly instead of sparse high dimensional feature vectors.

But I'm wondering if this is a regression or expected behavior as the same dataset trains fine on master, but as before with different numbers of bins and init values, which I guess is bug?:

salloc -N 3 mpiexec --machinefile hostnames.txt ../../lightgbm config=train.conf data=/shared/data/avazu-app.val num_trees=3 tree_learner=data
salloc: Granted job allocation 12
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=binary.train will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=binary.train will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=binary.train will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Local rank: 0, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 2, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 1, total number of machines: 3
LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 2, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 1, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Finished loading data in 2.554492 seconds
[LightGBM] [Info] Finished loading data in 2.557398 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Finished loading data in 2.561811 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Total Bins 3952
[LightGBM] [Info] Total Bins 3949
[LightGBM] [Info] Total Bins 3952
[LightGBM] [Info] Number of data: 650820, number of used features: 3925
[LightGBM] [Info] Number of data: 651651, number of used features: 3925
[LightGBM] [Info] Number of data: 651480, number of used features: 3925
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129703 -> initscore=-1.903591
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129677 -> initscore=-1.903820
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129048 -> initscore=-1.909405
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.376638
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.37674
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.37546
[LightGBM] [Info] Iteration:1, training auc : 0.73422
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 1.20257
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.5
[LightGBM] [Info] 0.252487 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:1, training auc : 0.734659
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 1.20257
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.5
[LightGBM] [Info] 0.252733 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:1, training auc : 0.735349
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 1.20257
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.5
[LightGBM] [Info] 0.243124 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.369988
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.368788
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.370093
[LightGBM] [Info] Iteration:2, training auc : 0.742183
[LightGBM] [Info] Iteration:2, valid_1 binary_logloss : 1.21594
[LightGBM] [Info] Iteration:2, valid_1 auc : 0.5
[LightGBM] [Info] 0.466288 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:2, training auc : 0.741538
[LightGBM] [Info] Iteration:2, valid_1 binary_logloss : 1.21594
[LightGBM] [Info] Iteration:2, valid_1 auc : 0.5
[LightGBM] [Info] 0.469050 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:2, training auc : 0.742812
[LightGBM] [Info] Iteration:2, valid_1 binary_logloss : 1.21594
[LightGBM] [Info] Iteration:2, valid_1 auc : 0.5
[LightGBM] [Info] 0.457309 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.364572
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.363365
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.364677
[LightGBM] [Info] Iteration:3, training auc : 0.7449
[LightGBM] [Info] Iteration:3, valid_1 binary_logloss : 1.23792
[LightGBM] [Info] Iteration:3, valid_1 auc : 0.5
[LightGBM] [Info] 0.698352 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:3, training auc : 0.745682
[LightGBM] [Info] Iteration:3, valid_1 binary_logloss : 1.23792
[LightGBM] [Info] Iteration:3, valid_1 auc : 0.5
[LightGBM] [Info] 0.688194 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:3, training auc : 0.744545
[LightGBM] [Info] Iteration:3, valid_1 binary_logloss : 1.23792
[LightGBM] [Info] Iteration:3, valid_1 auc : 0.5
[LightGBM] [Info] 0.707128 seconds elapsed, finished iteration 3
[LightGBM] [Info] Finished training
[LightGBM] [Info] Finished training
[LightGBM] [Info] Finished training
salloc: Relinquishing job allocation 12
salloc: Job allocation 12 has been revoked.

@guolinke
Copy link
Collaborator Author

guolinke commented Oct 4, 2019

@thvasilo did it throw the warning "Communication cost is too large for distributed dataset loading, using single mode instead."?

@guolinke
Copy link
Collaborator Author

guolinke commented Oct 4, 2019

if the warning exists, i think the error is caused by overflow.
BTW, there is a bug in the number of bin calculations in the master branch, for dist model. fixed in this branch.

@thvasilo
Copy link

thvasilo commented Oct 4, 2019

Unfortunately new errors popped up, on the datasets I've tried

This is for avazu-app.t (1M features)

salloc -N 3 mpiexec --machinefile hostnames.txt ./lightgbm config=examples/mpi/train.conf data=/shared/data/avazu-app.val num_trees=3 tree_learner=data
salloc: Granted job allocation 7
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=binary.train will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=binary.train will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/avazu-app.val, data=binary.train will be ignored. Current value: data=/shared/data/avazu-app.val
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Local rank: 0, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 1, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 2, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Finished loading data in 2.444463 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Finished loading data in 2.453785 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Finished loading data in 2.473461 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Total Bins 7850
[LightGBM] [Info] Total Bins 7850
[LightGBM] [Info] Total Bins 7850
[LightGBM] [Fatal] Check failed: offset == num_total_bin at /shared/LightGBM/src/treelearner/feature_histogram.hpp, line 747 .

[LightGBM] [Fatal] Check failed: offset == num_total_bin at /shared/LightGBM/src/treelearner/feature_histogram.hpp, line 747 .

[LightGBM] [Fatal] Check failed: offset == num_total_bin at /shared/LightGBM/src/treelearner/feature_histogram.hpp, line 747 .

[LightGBM] [Fatal] Check failed: offset == num_total_bin at /shared/LightGBM/src/treelearner/feature_histogram.hpp, line 747 .

[LightGBM] [Warning] [LightGBM] [Warning] [LightGBM] [Warning] Check failed: offset == num_total_bin at /shared/LightGBM/src/treelearner/feature_histogram.hpp, line 747 .

Check failed: offset == num_total_bin at /shared/LightGBM/src/treelearner/feature_histogram.hpp, line 747 .

<... same errors continue, process exits with error>

And this is for kdda.t (20M)

alloc -N 3 mpiexec --machinefile hostnames.txt ./lightgbm config=examples/mpi/train.conf data=/shared/data/kdda.t num_trees=3 tree_learner=data
salloc: Granted job allocation 8
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Local rank: 1, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 2, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 0, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Fatal] Check failed: num_total_features_ == static_cast<int>(bin_mappers->size()) at /shared/LightGBM/src/io/dataset.cpp, line 225 .

[LightGBM] [Fatal] Check failed: num_total_features_ == static_cast<int>(bin_mappers->size()) at /shared/LightGBM/src/io/dataset.cpp, line 225 .

[LightGBM] [Info] Finished loading data in 4.468614 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results

<Process stuck, needed to terminate Ctrl+C>

@guolinke
Copy link
Collaborator Author

guolinke commented Oct 4, 2019

yeah, actually, the master branch is okay, expect the print information is not right.
My previous fix for it introduce a new bug. Now it should work.

@thvasilo
Copy link

thvasilo commented Oct 4, 2019

Thanks for the prompt fix!

avazu-app.t will train fine now, still getting the same for kdda.t though.

 salloc -N 3 mpiexec --machinefile hostnames.txt ./lightgbm config=examples/mpi/train.conf data=/shared/data/kdda.t num_trees=3 tree_learner=data
salloc: Granted job allocation 12
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Local rank: 0, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 1, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 2, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Fatal] Check failed: num_total_features_ == static_cast<int>(bin_mappers->size()) at /shared/LightGBM/src/io/dataset.cpp, line 225 .

[LightGBM] [Fatal] Check failed: num_total_features_ == static_cast<int>(bin_mappers->size()) at /shared/LightGBM/src/io/dataset.cpp, line 225 .

[LightGBM] [Info] Finished loading data in 4.564516 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
^C
salloc: Relinquishing job allocation 12

@guolinke
Copy link
Collaborator Author

guolinke commented Oct 4, 2019

Thanks! I will find out what cause that checking failed.

@guolinke
Copy link
Collaborator Author

guolinke commented Oct 4, 2019

@thvasilo I think the latest commit should fix the check error. Could you try it again? Thanks very much!

@thvasilo
Copy link

thvasilo commented Oct 4, 2019

Thanks @guolinke it does indeed work now!

One last question I have is about initscore is it supposed to be the same for all workers?

Here's some example outputs for kdda:

[LightGBM] [Info] Number of positive: 442845, number of negative: 67457
[LightGBM] [Info] Number of positive: 442845, number of negative: 67457
[LightGBM] [Info] Number of positive: 442845, number of negative: 67457
[LightGBM] [Info] Total Bins 44054
[LightGBM] [Info] Total Bins 44054
[LightGBM] [Info] Total Bins 44054
[LightGBM] [Info] Number of data: 170354, number of used features: 8663
[LightGBM] [Info] Number of data: 170291, number of used features: 8663
[LightGBM] [Info] Number of data: 169657, number of used features: 8663
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.867359 -> initscore=1.877803
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.867297 -> initscore=1.877268
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.868772 -> initscore=1.890142
[LightGBM] [Info] Start training from score 1.881737
[LightGBM] [Info] Start training from score 1.881737
[LightGBM] [Info] Start training from score 1.881737 

and avazu-app.val:

[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Number of positive: 252989, number of negative: 1700962
[LightGBM] [Info] Total Bins 7850
[LightGBM] [Info] Total Bins 7850
[LightGBM] [Info] Total Bins 7850
[LightGBM] [Info] Number of data: 651480, number of used features: 3925
[LightGBM] [Info] Number of data: 651651, number of used features: 3925
[LightGBM] [Info] Number of data: 650820, number of used features: 3925
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129703 -> initscore=-1.903591
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129048 -> initscore=-1.909405
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129677 -> initscore=-1.903820
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605

@guolinke
Copy link
Collaborator Author

guolinke commented Oct 4, 2019

the init_score is sync. you can find:

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129703 -> initscore=-1.903591
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129048 -> initscore=-1.909405
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129677 -> initscore=-1.903820
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605
[LightGBM] [Info] Start training from score -1.905605

the last 3 lines are the sync score in 3 nodes.

@guolinke
Copy link
Collaborator Author

guolinke commented Oct 4, 2019

Thanks @thvasilo very much! now the distributed mode can be efficiently ran over sparse features too.

@thvasilo
Copy link

thvasilo commented Oct 4, 2019

One thing I noticed, when setting min_data_in_leaf = 1 I get the following error, but only for kdda.t, avazu-app.val is works fine:

[ip-172-31-52-7:06473] *** Process received signal ***
[ip-172-31-52-7:06473] Signal: Segmentation fault (11)
[ip-172-31-52-7:06473] Signal code: Address not mapped (1)
[ip-172-31-52-7:06473] Failing at address: 0x100011
[ip-172-31-52-7:06473] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f2e6c4e6390]
[ip-172-31-52-7:06473] [ 1] ./lightgbm(_ZN8LightGBM16GetConfilctCountERKSt6vectorIbSaIbEEPKiii+0x9)[0x4fd0b9]
[ip-172-31-52-7:06473] [ 2] ./lightgbm(_ZN8LightGBM10FindGroupsERKSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS2_EESaIS5_EERKS0_IiSaIiEEPPiPKimiiib+0x802)[0x50a302]
[ip-172-31-52-7:06473] [ 3] ./lightgbm(_ZN8LightGBM19FastFeatureBundlingERKSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS2_EESaIS5_EEPPiPKimRKS0_IiSaIiEEdiidbb+0x5ee)[0x50c2fe]
[ip-172-31-52-7:06473] [ 4] ./lightgbm(_ZN8LightGBM7Dataset9ConstructEPSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS3_EESaIS6_EEiRKS1_IS1_IdSaIdEESaISB_EEPPiPKimRKNS_6ConfigE+0x23d)[0x50d21d]
[ip-172-31-52-7:06473] [ 5] ./lightgbm(_ZN8LightGBM13DatasetLoader31ConstructBinMappersFromTextDataEiiRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EEPKNS_6ParserEPNS_7DatasetE+0x1d80)[0x527cf0]
[ip-172-31-52-7:06473] [ 6] ./lightgbm(_ZN8LightGBM13DatasetLoader12LoadFromFileEPKcS2_ii+0x1ed)[0x52aa8d]
[ip-172-31-52-7:06473] [ 7] ./lightgbm(_ZN8LightGBM11Application8LoadDataEv+0x24a)[0x443f4a]
[ip-172-31-52-7:06473] [ 8] ./lightgbm(_ZN8LightGBM11Application9InitTrainEv+0x181)[0x445171]
[ip-172-31-52-7:06473] [ 9] ./lightgbm(main+0x49)[0x440199]
[ip-172-31-52-7:06473] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f2e6c12b830]
[ip-172-31-52-7:06473] [11] ./lightgbm(_start+0x29)[0x442779]
[ip-172-31-52-7:06473] *** End of error message ***

If I set the parameter to 2 (or other values > 1) this does not happen. Seems like a corner case :/

For me the current PR is good enough, just something to keep in mind.

@guolinke
Copy link
Collaborator Author

guolinke commented Oct 4, 2019

@thvasilo it is not trival to locate what cause the fail, I add a possible fix, you can have a try 😄

@thvasilo
Copy link

thvasilo commented Oct 4, 2019

Now it takes much longer, but still get an error. I'd just note this as a known issue if it is to be included in the release, as you said this would be hard to track down.

[ip-172-31-51-139:08446] *** Process received signal ***
[ip-172-31-51-139:08446] Signal: Segmentation fault (11)
[ip-172-31-51-139:08446] Signal code: Address not mapped (1)
[ip-172-31-51-139:08446] Failing at address: 0x5331
[ip-172-31-51-139:08446] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f9b541a9390]
[ip-172-31-51-139:08446] [ 1] ./lightgbm(_ZN8LightGBM16GetConfilctCountERKSt6vectorIbSaIbEEPKiii+0x9)[0x4fd0b9]
[ip-172-31-51-139:08446] [ 2] ./lightgbm(_ZN8LightGBM10FindGroupsERKSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS2_EESaIS5_EERKS0_IiSaIiEEPPiPKimiiib+0x802)[0x50a302]
[ip-172-31-51-139:08446] [ 3] ./lightgbm(_ZN8LightGBM19FastFeatureBundlingERKSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS2_EESaIS5_EEPPiPKimRKS0_IiSaIiEEdiidbb+0x5ee)[0x50c2fe]
[ip-172-31-51-139:08446] [ 4] ./lightgbm(_ZN8LightGBM7Dataset9ConstructEPSt6vectorISt10unique_ptrINS_9BinMapperESt14default_deleteIS3_EESaIS6_EEiRKS1_IS1_IdSaIdEESaISB_EEPPiPKimRKNS_6ConfigE+0x23d)[0x50d21d]
[ip-172-31-51-139:08446] [ 5] ./lightgbm(_ZN8LightGBM13DatasetLoader31ConstructBinMappersFromTextDataEiiRKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EEPKNS_6ParserEPNS_7DatasetE+0x1d80)[0x527cf0]
[ip-172-31-51-139:08446] [ 6] ./lightgbm(_ZN8LightGBM13DatasetLoader12LoadFromFileEPKcS2_ii+0x1ed)[0x52aa8d]
[ip-172-31-51-139:08446] [ 7] ./lightgbm(_ZN8LightGBM11Application8LoadDataEv+0x24a)[0x443f4a]
[ip-172-31-51-139:08446] [ 8] ./lightgbm(_ZN8LightGBM11Application9InitTrainEv+0x181)[0x445171]
[ip-172-31-51-139:08446] [ 9] ./lightgbm(main+0x49)[0x440199]
[ip-172-31-51-139:08446] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f9b53dee830]
[ip-172-31-51-139:08446] [11] ./lightgbm(_start+0x29)[0x442779]
[ip-172-31-51-139:08446] *** End of error message ***

@guolinke
Copy link
Collaborator Author

guolinke commented Oct 5, 2019

Thanks, I think I need to debug for that. Could the fail reproduce when run in the single node?

@thvasilo
Copy link

thvasilo commented Oct 7, 2019

Running on a single node (no MPI parallel training) seems to be working fine.

One question I had: when I set min_data_in_leaf = 1 the number of total bins and used features changes, is this expected behavior? May I ask why that is in that case?

min_data_in_leaf = 1

[LightGBM] [Info] Total Bins 2268113
[LightGBM] [Info] Number of data: 510302, number of used features: 1118008

min_data_in_leaf = 50

[LightGBM] [Info] Total Bins 83942
[LightGBM] [Info] Number of data: 510302, number of used features: 26463

@guolinke
Copy link
Collaborator Author

guolinke commented Oct 7, 2019

@thvasilo yeah, it is expected behavior.
LightGBM will pre-prune the feature that cannot split.
when you set min_data_in_leaf to k, for a very sparse feature with non-zero samples smaller than k, will be pruned.

@guolinke
Copy link
Collaborator Author

guolinke commented Oct 8, 2019

@thvasilo could you try one more time? If this still failed, I think I cannot fix it temporarily.

@guolinke guolinke merged commit 40e56ca into master Oct 15, 2019
@StrikerRUS
Copy link
Collaborator

@guolinke
Just noticed this warning:

[ 43%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/file_io.cpp.o
/__w/1/s/src/io/dataset_loader.cpp: In member function ‘void LightGBM::DatasetLoader::ConstructBinMappersFromTextData(int, int, const std::vector<std::basic_string<char> >&, const LightGBM::Parser*, LightGBM::Dataset*)’:
/__w/1/s/src/io/dataset_loader.cpp:984:49: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
       if (sample_values.size() <= start[rank] + i) {
                                                 ^

@thvasilo
Copy link

Hello @guolinke sorry I got caught up with other stuff. Can confirm that training on kdda.t (20M features) with min_data_in_leaf=1 works as expected.

Example output:

salloc -N 3 mpiexec --machinefile ~/hostnames.txt ../../lightgbm config=train.conf data=/shared/data/kdda.t num_trees=3 tree_learner=data min_data_in_leaf=1                                                                                                                                
salloc: Granted job allocation 21
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] min_data_in_leaf is set=1, min_data_in_leaf=50 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] min_data_in_leaf is set=1, min_data_in_leaf=50 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Warning] data is set=/shared/data/kdda.t, data=binary.train will be ignored. Current value: data=/shared/data/kdda.t
[LightGBM] [Warning] tree_learner is set=data, tree_learner=feature will be ignored. Current value: tree_learner=data
[LightGBM] [Warning] min_data_in_leaf is set=1, min_data_in_leaf=50 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] num_iterations is set=3, num_trees=100 will be ignored. Current value: num_iterations=3
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Local rank: 2, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 0, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Local rank: 1, total number of machines: 3
[LightGBM] [Info] Finished initializing network
[LightGBM] [Info] Finished loading data in 14.300335 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Finished loading data in 14.709807 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Finished loading data in 14.874291 seconds
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 442845, number of negative: 67457
[LightGBM] [Info] Number of positive: 442845, number of negative: 67457
[LightGBM] [Info] Number of positive: 442845, number of negative: 67457
[LightGBM] [Info] Total Bins 2050839
[LightGBM] [Info] Total Bins 2050839
[LightGBM] [Info] Total Bins 2050839
[LightGBM] [Info] Number of data: 169657, number of used features: 1010155
[LightGBM] [Info] Number of data: 170291, number of used features: 1010155
[LightGBM] [Info] Number of data: 170354, number of used features: 1010155
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.867297 -> initscore=1.877268
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.868772 -> initscore=1.890142
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.867359 -> initscore=1.877803
[LightGBM] [Info] Start training from score 1.881737
[LightGBM] [Info] Start training from score 1.881737
[LightGBM] [Info] Start training from score 1.881737
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.375075
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.377174
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.37734
[LightGBM] [Info] Iteration:1, training auc : 0.717413
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 0.967661
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.5
[LightGBM] [Info] 8.772978 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:1, training auc : 0.718757
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 0.967661
[LightGBM] [Info] Iteration:1, training auc : 0.719905
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 0.967661
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.5
[LightGBM] [Info] 7.957943 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.5
[LightGBM] [Info] 8.864428 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.368703
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.36674
[LightGBM] [Info] Iteration:2, training binary_logloss : 0.36853
[LightGBM] [Info] Iteration:2, training auc : 0.736444
[LightGBM] [Info] Iteration:2, valid_1 binary_logloss : 0.924669
[LightGBM] [Info] Iteration:2, valid_1 auc : 0.5
[LightGBM] [Info] 16.853904 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:2, training auc : 0.732943
[LightGBM] [Info] Iteration:2, valid_1 binary_logloss : 0.924669
[LightGBM] [Info] Iteration:2, valid_1 auc : 0.5
[LightGBM] [Info] 16.764278 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:2, training auc : 0.734199
[LightGBM] [Info] Iteration:2, valid_1 binary_logloss : 0.924669
[LightGBM] [Info] Iteration:2, valid_1 auc : 0.5
[LightGBM] [Info] 15.952404 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.362164
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.360501
[LightGBM] [Info] Iteration:3, training binary_logloss : 0.362294
[LightGBM] [Info] Iteration:3, training auc : 0.739219
[LightGBM] [Info] Iteration:3, valid_1 binary_logloss : 0.89998
[LightGBM] [Info] Iteration:3, valid_1 auc : 0.5
[LightGBM] [Info] 23.937869 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:3, training auc : 0.738611
[LightGBM] [Info] Iteration:3, valid_1 binary_logloss : 0.89998
[LightGBM] [Info] Iteration:3, valid_1 auc : 0.5
[LightGBM] [Info] 24.754654 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:3, training auc : 0.742167
[LightGBM] [Info] Iteration:3, valid_1 binary_logloss : 0.89998
[LightGBM] [Info] Iteration:3, valid_1 auc : 0.5
[LightGBM] [Info] 24.845918 seconds elapsed, finished iteration 3
[LightGBM] [Info] Finished training
[LightGBM] [Info] Finished training
[LightGBM] [Info] Finished training

@StrikerRUS StrikerRUS deleted the dist-memory-reduce branch October 21, 2019 12:05
@lock lock bot locked as resolved and limited conversation to collaborators Mar 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data loading errors with extremely sparse datasets
3 participants